A heuristic approach to effective and efficient clustering on uncertain objects
نویسندگان
چکیده
We study the problem of clustering uncertain objects whose locations are uncertain and described by probability density functions (pdf). We analyze existing pruning algorithms and experimentally show that there exists a new bottleneck in the performance due to the overhead of pruning candidate clusters for assignment of each uncertain object in each iteration. In this article, we will show that by considering squared Euclidean distance, UK-means (without pruning techniques) is reduced to K-means and performs much faster than pruning algorithms, however, with some discrepancies in the clustering results due to using different distance functions. Thus, we propose Approximate UK-means to heuristically identify objects of boundary cases and reassign them to better clusters. Three models for the representation of cluster representative (certain model, uncertain model and heuristic model) are proposed to calculate expected squared Euclidean distance between objects and cluster representatives in this paper. Our experimental results show that on average the execution time of Approximate UK-means is only 25% more than K-means and our approach reduces the discrepancies of K-means' clustering results by up to 70%. While there has been a large amount of research on mining and queries on relational databases [1], the focus has been on databases that store data in exact values. In many real-life applications , however, the raw data (for example, in the case of sensor data) are not precise or accurate when they were collected or produced. The clustering of such ''uncertain'' data can be illustrated by the following simple realistic example. Considering sensors on wild animals that update their locations periodically, the sample locations of an animal over a period generate a (discrete) probability distribution function (PDF) describing the possible locations of the animal. Clustering results on those animals may reveal the possible groups and interactions between them. In our work, we consider the problem of clustering objects with multidimensional uncertainty where an object is represented by an uncertain region over which a discrete probability distribution function (PDF) or a probability density function (pdf) is defined.
منابع مشابه
A Clustering Approach by SSPCO Optimization Algorithm Based on Chaotic Initial Population
Assigning a set of objects to groups such that objects in one group or cluster are more similar to each other than the other clusters’ objects is the main task of clustering analysis. SSPCO optimization algorithm is anew optimization algorithm that is inspired by the behavior of a type of bird called see-see partridge. One of the things that smart algorithms are applied to solve is the problem ...
متن کاملA Heuristic on Effective and Efficient Clustering on Uncertain Objects
We study the problem of clustering uncertain objects whose locations are uncertain and described by probability density functions. We analyze existing pruning algorithms and experimentally show that there exists a new bottleneck in the performance due to the overhead while pruning candidate clusters for assignment of each uncertain object in each iteration. We further show that by considering s...
متن کاملClustering and Classification on Uncertain Data
We study the problem of mining on uncertain objects whose locations are uncertain and described by probability density functions (pdf). Clustering and classification are two important tasks in data mining. Clustering on uncertain objects is different from traditional case on certain objects. UK-means is proposed based on K-means but it is time consuming. Pruning techniques are proposed to impro...
متن کاملA Hybrid Data Clustering Algorithm Using Modified Krill Herd Algorithm and K-MEANS
Data clustering is the process of partitioning a set of data objects into meaning clusters or groups. Due to the vast usage of clustering algorithms in many fields, a lot of research is still going on to find the best and efficient clustering algorithm. K-means is simple and easy to implement, but it suffers from initialization of cluster center and hence trapped in local optimum. In this paper...
متن کاملA Hybrid Time Series Clustering Method Based on Fuzzy C-Means Algorithm: An Agreement Based Clustering Approach
In recent years, the advancement of information gathering technologies such as GPS and GSM networks have led to huge complex datasets such as time series and trajectories. As a result it is essential to use appropriate methods to analyze the produced large raw datasets. Extracting useful information from large data sets has always been one of the most important challenges in different sciences,...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- Knowl.-Based Syst.
دوره 66 شماره
صفحات -
تاریخ انتشار 2014